Welcome to Tidyverse

ANU BDSI
workshop
Data Wrangling with R Part 1

Emi Tanaka

Biological Data Science Institute

8th April 2024

Current learning objectives

  • -Recognize the characteristics of tidy data
  • Differentiate between the Base and Tidyverse paradigms
  • Acquire the skills to add/modify columns, subset data by rows and columns, rename column names, and perform group operations using dplyr
  • -Pivot data into longer or wider format using tidyr
  • -Join datasets using dplyr

Base R

  • R has 7 packages, collectively referred to as the “base R”, that are loaded automatically when you launch it.
  • The functions in the base packages are generally well-tested and trustworthy.

Tidyverse

  • Tidyverse refers to a collection of R-packages that share a common (opinionated) design philosophy, grammar and data structure.
  • This trains your mental model to do data science tasks in a manner which may make it easier, faster, and/or fun for you to do these tasks.
  • library(tidyverse) is a shorthand for loading the 9 core tidyverse packages.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data wrangling with Base R

data.frame (Base R)

  • In R, data.frame is a special class of list.
  • Every column in the data.frame is a vector of the same length.
  • Each entry in a vector have the same type, e.g. logical, integer, double (or numeric), character or factor.
  • It has an attribute row.names which could be a sequence of integers indexing the row number or some unique identifier.

Subsetting by column (Base R)

  • Remember, data.frame is just a special type of list and inherit methods applied for list.
  • data.frame also can be accessed by array index:

Subsetting by column (Base R)

  • If you subset a single column using [, ], then by default the output is a vector and not a data.frame.

Subsetting by row (Base R)

Adding or modifying a column (Base R)

Adding a row (Base R)

What do you notice with the order of the new entry?

Sorting columns (Base R)

Sorting rows (Base R)

Calculating statistical summaries by group (Base R)

🎯 Calculate the average weight (wt) of a car for each gear type in (gear) mtcars

🎯 Calculate the median weight (wt) of a car for each gear (gear) and engine (vs) type in mtcars

Data wrangling with Tidyverse

A grammar of data manipulation

  • dplyr is a core package in tidyverse
  • The earlier concept of dplyr (first on CRAN in 2014-01-29) was implemented in plyr (first on CRAN in 2008-10-08).
  • The functions in dplyr has been evolving frequently but dplyr v1.0.0 was released on CRAN in 2020-05-29 suggesting that functions in dplyr are maturing and thus the user interface is unlikely to change.

dplyr “verbs”

  • The main functions of dplyr include:
    • arrange
    • select
    • mutate
    • rename
    • group_by
    • summarise
  • Notice that these functions are verbs.

dplyr structure

  • Functions in dplyr generally have the form:
verb(data, args)
  • The first argument data is a data.frame object.
  • What do you think the following will do?

Pipe operator

  • Almost all the tidyverse packages use the pipe operator %>% from the magrittr package.
  • R version 4.1.0 introduced a native pipe operator |> which is similar to %>% but with some differences.
  • x |> f(y) is the same as f(x, y).
  • x |> f(y) |> g(z) is the same as g(f(x, y), z).
  • When you see the pipe operator, read it as “and then”.

Tidyselect

  • Tidyverse packages generally use syntax from the tidyselect package for column selection.

Selection language

tibble (Tidyverse)

  • tibble is a modern reimagining of the data.frame object.

Subsetting by column (Tidyverse)

Subsetting by row (Tidyverse)

Adding or modifying a column (Tidyverse)

mtcars |> mutate(gpm = 1/mpg)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
                           gpm
Mazda RX4           0.04761905
Mazda RX4 Wag       0.04761905
Datsun 710          0.04385965
Hornet 4 Drive      0.04672897
Hornet Sportabout   0.05347594
Valiant             0.05524862
Duster 360          0.06993007
Merc 240D           0.04098361
Merc 230            0.04385965
Merc 280            0.05208333
Merc 280C           0.05617978
Merc 450SE          0.06097561
Merc 450SL          0.05780347
Merc 450SLC         0.06578947
Cadillac Fleetwood  0.09615385
Lincoln Continental 0.09615385
Chrysler Imperial   0.06802721
Fiat 128            0.03086420
Honda Civic         0.03289474
Toyota Corolla      0.02949853
Toyota Corona       0.04651163
Dodge Challenger    0.06451613
AMC Javelin         0.06578947
Camaro Z28          0.07518797
Pontiac Firebird    0.05208333
Fiat X1-9           0.03663004
Porsche 914-2       0.03846154
Lotus Europa        0.03289474
Ford Pantera L      0.06329114
Ferrari Dino        0.05076142
Maserati Bora       0.06666667
Volvo 142E          0.04672897
mtcars |> mutate(engine = case_when(vs==0 ~ "V-shaped",
                                    vs==1 ~ "Straight"),
                 transmission = ifelse(am==0, "Automatic", "Manual"))
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
                      engine transmission
Mazda RX4           V-shaped       Manual
Mazda RX4 Wag       V-shaped       Manual
Datsun 710          Straight       Manual
Hornet 4 Drive      Straight    Automatic
Hornet Sportabout   V-shaped    Automatic
Valiant             Straight    Automatic
Duster 360          V-shaped    Automatic
Merc 240D           Straight    Automatic
Merc 230            Straight    Automatic
Merc 280            Straight    Automatic
Merc 280C           Straight    Automatic
Merc 450SE          V-shaped    Automatic
Merc 450SL          V-shaped    Automatic
Merc 450SLC         V-shaped    Automatic
Cadillac Fleetwood  V-shaped    Automatic
Lincoln Continental V-shaped    Automatic
Chrysler Imperial   V-shaped    Automatic
Fiat 128            Straight       Manual
Honda Civic         Straight       Manual
Toyota Corolla      Straight       Manual
Toyota Corona       Straight    Automatic
Dodge Challenger    V-shaped    Automatic
AMC Javelin         V-shaped    Automatic
Camaro Z28          V-shaped    Automatic
Pontiac Firebird    V-shaped    Automatic
Fiat X1-9           Straight       Manual
Porsche 914-2       V-shaped       Manual
Lotus Europa        Straight       Manual
Ford Pantera L      V-shaped       Manual
Ferrari Dino        V-shaped       Manual
Maserati Bora       V-shaped       Manual
Volvo 142E          Straight       Manual

Adding a row (Tidyverse)

Sorting columns (Tidyverse)

Sorting rows (Tidyverse)

Calculating statistical summaries by group (Tidyverse)

Applying a function to multiple columns

Summary

  • Data wrangling in Tidyverse just gives you a different flavour of how you can do things in Base R.
  • Tidyverse packages share a common design philosophy, grammar and data structure.
  • This trains your mental model to do data science tasks in a certain manner which may make it easier, faster, and/or fun for you to do these tasks.

dplyr cheatsheet

Exercise time

15:00